Terrorist attacks, defined by the UMD START Consortium as "the threatened or actual use of illegal force and violence by a nonstate actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation", are a tragic and difficult reality that we face in our modern world. Lives, families, and societies are demolished by these events, and their prolificity is a stain on world relations. The goal of this project is to describe trends in terrorist activity between 1970 and 2021 across the world in order to inform the global community as to the scope and impacts of such activity.
Throughout this tutorial, we will analyze trends in terrorist activity, including between the types of attacks and their locations, the types of attacks and frequency of attacks, and frequency of attacks and change in Gross Domestic Product.
The Python libraries used in this project include:
pandas, used to input and process tabular dataopenpyxl, an input engine for spreadsheet files required by pandas for read_excelnumpy, used along with pandas to process large datacountry_converter, a library for converting country names to ISO3 codes for easier data merginglogging, used to suppress messages from country_convertermatplotlib, used to generate and clean plots describing trends we findgeopandas, used to add a geometry to a pandas DataFrame in order to plot onto simple maps using longitude and latitude datasklearn, used to make models of dataThe atypical packages we are using, openpyxl, country_converter, and geopandas can be installed here in Jupyter:
%%capture
!pip3 install openpyxl country_converter geopandas
In the Data Collection phase, we collect data and "tidy" it to allow us to easily analyze it later.
Terrorist attack data was retrieved from the UMD START Consortium's Global Terrorism Database, which provides an .xlsx file containing detailed data regarding terrorist attacks which have happened between 1970 and 2021, including date, time, and location of the event, type(s) of attack, estimated amount of injuries and lives lost, and estimated value of property damage.
GDP data was retrieved from the World Bank, which provides a .csv data containing GDP data for each country in each year.
In these first steps, we will import each dataset, modify the shape of each dataset in order to make it easier to process later, add or modify columns of each dataset, and combine the two datasets into one cumulative dataset.
We first import all necessary libraries:
import pandas as pd
import numpy as np
import country_converter as coco
import logging
import matplotlib.pyplot as plt
import geopandas
from sklearn import linear_model
And we will import the GDP dataset:
gdp_data = pd.read_csv("final_data/gdp.csv")
gdp_data.head()
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2.615084e+09 | 2.727933e+09 | 2.791061e+09 | 2.963128e+09 | 2.983799e+09 | 3.092179e+09 | 3.202235e+09 | 3.310056e+09 | 2.496648e+09 | NaN |
| 1 | Africa Eastern and Southern | AFE | GDP (current US$) | NY.GDP.MKTP.CD | 2.129059e+10 | 2.180847e+10 | 2.370702e+10 | 2.821004e+10 | 2.611879e+10 | 2.968217e+10 | ... | 9.730435e+11 | 9.839370e+11 | 1.003679e+12 | 9.242525e+11 | 8.823551e+11 | 1.020647e+12 | 9.910223e+11 | 9.975340e+11 | 9.216459e+11 | 1.082096e+12 |
| 2 | Afghanistan | AFG | GDP (current US$) | NY.GDP.MKTP.CD | 5.377778e+08 | 5.488889e+08 | 5.466667e+08 | 7.511112e+08 | 8.000000e+08 | 1.006667e+09 | ... | 1.990732e+10 | 2.014640e+10 | 2.049713e+10 | 1.913421e+10 | 1.811656e+10 | 1.875347e+10 | 1.805323e+10 | 1.879945e+10 | 2.011614e+10 | NaN |
| 3 | Africa Western and Central | AFW | GDP (current US$) | NY.GDP.MKTP.CD | 1.040414e+10 | 1.112789e+10 | 1.194319e+10 | 1.267633e+10 | 1.383837e+10 | 1.486223e+10 | ... | 7.275704e+11 | 8.207927e+11 | 8.649905e+11 | 7.607345e+11 | 6.905464e+11 | 6.837487e+11 | 7.416899e+11 | 7.945430e+11 | 7.844457e+11 | 8.358084e+11 |
| 4 | Angola | AGO | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.249982e+11 | 1.334016e+11 | 1.372444e+11 | 8.721929e+10 | 4.984049e+10 | 6.897276e+10 | 7.779294e+10 | 6.930910e+10 | 5.361907e+10 | 7.254699e+10 |
5 rows × 66 columns
Wow, that's ugly! It seems like each year is a column in this dataset, which makes it more difficult for us to model anything related to time, as we do not have any Year column to regress over. Hence, we have to melt (same as pivot_longer in R) all of these numerical columns into a single Year column while maintaining all other columns:
gdp_data = gdp_data.melt(id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"], var_name="Year", value_name="GDP")
gdp_data.head()
| Country Name | Country Code | Indicator Name | Indicator Code | Year | GDP | |
|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | NaN |
| 1 | Africa Eastern and Southern | AFE | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | 2.129059e+10 |
| 2 | Afghanistan | AFG | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | 5.377778e+08 |
| 3 | Africa Western and Central | AFW | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | 1.040414e+10 |
| 4 | Angola | AGO | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | NaN |
Much better! Each row is now a single observation of one country's GDP in one year.
However, note that column names in Pandas are stored as strings, so all of the columns we parsed as Years are in string format. We can verify this by checking the data types of the column using pandas's dtypes attribute of DataFrames:
gdp_data.dtypes
Country Name object Country Code object Indicator Name object Indicator Code object Year object GDP float64 dtype: object
Note that the Year column has the object data type, which is not an integer, like we would have hoped. Thus, we have to change each entry in this column to an integer, which we can do using the .apply() method of DataFrames and the built-in int() function which converts its input to an integer:
gdp_data["Year"] = gdp_data["Year"].apply(int)
gdp_data.dtypes
Country Name object Country Code object Indicator Name object Indicator Code object Year int64 GDP float64 dtype: object
Now Year has type int64.
Now, we will finally add a gdp_change column indicating how much GDP dropped from the current year to the next, which we will use to see if there exists a link between high amounts of terrorist activity and a given year and a subsequent drop in GDP over the next year:
gdp_data["gdp_change"] = gdp_data.groupby('Country Code')["GDP"].diff(periods=-1) * -1
gdp_data.head()
| Country Name | Country Code | Indicator Name | Indicator Code | Year | GDP | gdp_change | |
|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | 2.129059e+10 | 5.178878e+08 |
| 2 | Afghanistan | AFG | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | 5.377778e+08 | 1.111108e+07 |
| 3 | Africa Western and Central | AFW | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | 1.040414e+10 | 7.237596e+08 |
| 4 | Angola | AGO | GDP (current US$) | NY.GDP.MKTP.CD | 1960 | NaN | NaN |
Now we will import the START terrorism dataset:
terrorism_data = pd.read_excel("final_data/globalterrorismdb_0522dist.xlsx")
terrorism_data.head()
| eventid | iyear | imonth | iday | approxdate | extended | resolution | country | country_txt | region | ... | addnotes | scite1 | scite2 | scite3 | dbsource | INT_LOG | INT_IDEO | INT_MISC | INT_ANY | related | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 197000000001 | 1970 | 7 | 2 | NaN | 0 | NaT | 58 | Dominican Republic | 2 | ... | NaN | NaN | NaN | NaN | PGIS | 0 | 0 | 0 | 0 | NaN |
| 1 | 197000000002 | 1970 | 0 | 0 | NaN | 0 | NaT | 130 | Mexico | 1 | ... | NaN | NaN | NaN | NaN | PGIS | 0 | 1 | 1 | 1 | NaN |
| 2 | 197001000001 | 1970 | 1 | 0 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
| 3 | 197001000002 | 1970 | 1 | 0 | NaN | 0 | NaT | 78 | Greece | 8 | ... | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
| 4 | 197001000003 | 1970 | 1 | 0 | NaN | 0 | NaT | 101 | Japan | 4 | ... | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
5 rows × 135 columns
Note that we have 135 columns in this dataset, the documentation for which is found in the GTD codebook.
We wish to combine this dataset with the GDP dataset, though we cannot do so as easily as we would like. We hope to do so by merging the two datasets by matching rows between certain columns, though the country names in each dataset are distinct. Observe:
gdp_data[gdp_data["Country Name"] == "Korea, Rep."]["Country Name"].unique()
array(['Korea, Rep.'], dtype=object)
terrorism_data[terrorism_data["country_txt"] == "South Korea"]["country_txt"].unique()
array(['South Korea'], dtype=object)
Note that the GDP dataset uses the formal name of "Korea, Rep." in reference to South Korea, whereas the terrorism dataset uses simply "South Korea". These inconsistencies are annoying, but we can also take advantage of standardized country codes and merge on that. The GDP dataset already includes ISO3 codes for each entry, so we just have to add them to the terrorism dataset, which we will do using country_converter:
cc = coco.CountryConverter()
coco_logger = coco.logging.getLogger()
coco_logger.setLevel(logging.CRITICAL)
terrorism_data["country_code"] = cc.pandas_convert(series=terrorism_data["country_txt"], to='ISO3', not_found=np.nan)
terrorism_data.head()
| eventid | iyear | imonth | iday | approxdate | extended | resolution | country | country_txt | region | ... | scite1 | scite2 | scite3 | dbsource | INT_LOG | INT_IDEO | INT_MISC | INT_ANY | related | country_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 197000000001 | 1970 | 7 | 2 | NaN | 0 | NaT | 58 | Dominican Republic | 2 | ... | NaN | NaN | NaN | PGIS | 0 | 0 | 0 | 0 | NaN | DOM |
| 1 | 197000000002 | 1970 | 0 | 0 | NaN | 0 | NaT | 130 | Mexico | 1 | ... | NaN | NaN | NaN | PGIS | 0 | 1 | 1 | 1 | NaN | MEX |
| 2 | 197001000001 | 1970 | 1 | 0 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN | PHL |
| 3 | 197001000002 | 1970 | 1 | 0 | NaN | 0 | NaT | 78 | Greece | 8 | ... | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN | GRC |
| 4 | 197001000003 | 1970 | 1 | 0 | NaN | 0 | NaT | 101 | Japan | 4 | ... | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN | JPN |
5 rows × 136 columns
However, note that country_converter is not perfect, so we have some countries which could not be converted into ISO3 codes. We cannot do anything with these, so we will discard these data when dealing with GDP:
terrorism_data[terrorism_data["country_code"].isna()]["country_txt"].value_counts()
West Germany (FRG) 541 Yugoslavia 203 Soviet Union 78 East Germany (GDR) 38 Serbia-Montenegro 11 People's Republic of the Congo 4 International 1 Name: country_txt, dtype: int64
Now we will add a column with text representations of the coded attacktype1 column, which corresponds to the type of terrorist event that occurred:
atk_type_keys = ['Assassination/Assassination Attempt',
'Armed Assault',
'Bombing/Explosion',
'Hijacking',
'Hostage Taking (Barricading)',
'Hostage Taking (Kidnapping)',
'Facility/Infrastructure Attack',
'Unarmed Assault (inc. Chem., Bio., Rad. attacks)',
'Unknown']
terrorism_data["atk1_txt"] = [atk_type_keys[atk-1] for atk in terrorism_data["attacktype1"]]
Now we can finally merge our two datasets, including only entries found in the terrorism dataset (as we do not need any GDP data outside of our bounds):
gdp_terrorism = terrorism_data.merge(gdp_data, how="left", left_on=["iyear","country_code"], right_on=["Year","Country Code"])
gdp_terrorism.head()
| eventid | iyear | imonth | iday | approxdate | extended | resolution | country | country_txt | region | ... | related | country_code | atk1_txt | Country Name | Country Code | Indicator Name | Indicator Code | Year | GDP | gdp_change | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 197000000001 | 1970 | 7 | 2 | NaN | 0 | NaT | 58 | Dominican Republic | 2 | ... | NaN | DOM | Assassination/Assassination Attempt | Dominican Republic | DOM | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 1.485500e+09 | 1.810000e+08 |
| 1 | 197000000002 | 1970 | 0 | 0 | NaN | 0 | NaT | 130 | Mexico | 1 | ... | NaN | MEX | Hostage Taking (Kidnapping) | Mexico | MEX | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 3.552000e+10 | 3.680000e+09 |
| 2 | 197001000001 | 1970 | 1 | 0 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | NaN | PHL | Assassination/Assassination Attempt | Philippines | PHL | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 7.559180e+09 | 8.159065e+08 |
| 3 | 197001000002 | 1970 | 1 | 0 | NaN | 0 | NaT | 78 | Greece | 8 | ... | NaN | GRC | Bombing/Explosion | Greece | GRC | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 1.313986e+10 | 1.451886e+09 |
| 4 | 197001000003 | 1970 | 1 | 0 | NaN | 0 | NaT | 101 | Japan | 4 | ... | NaN | JPN | Facility/Infrastructure Attack | Japan | JPN | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 2.126092e+11 | 2.754262e+10 |
5 rows × 144 columns
In this stage of the data analysis pipeline, we plot different aspects of our data in order to better understand the data and discover potential trends which might exist within the data. We can map data, plot data over time, and perform rudimentary statistical analyses to better inform our hypothesis testing later.
We will first use geopandas to plot our START data on a world map in order to see how different types of attacks are distributed across the world. To do this, we first convert our data into a GeoDataFrame, which gives each datum a value for its geometry, in this case a point with the latitude and longitude of the event:
gdf = geopandas.GeoDataFrame(
gdp_terrorism, geometry=geopandas.points_from_xy(gdp_terrorism.longitude, gdp_terrorism.latitude))
gdf.head()
| eventid | iyear | imonth | iday | approxdate | extended | resolution | country | country_txt | region | ... | country_code | atk1_txt | Country Name | Country Code | Indicator Name | Indicator Code | Year | GDP | gdp_change | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 197000000001 | 1970 | 7 | 2 | NaN | 0 | NaT | 58 | Dominican Republic | 2 | ... | DOM | Assassination/Assassination Attempt | Dominican Republic | DOM | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 1.485500e+09 | 1.810000e+08 | POINT (-69.95116 18.45679) |
| 1 | 197000000002 | 1970 | 0 | 0 | NaN | 0 | NaT | 130 | Mexico | 1 | ... | MEX | Hostage Taking (Kidnapping) | Mexico | MEX | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 3.552000e+10 | 3.680000e+09 | POINT (-99.08662 19.37189) |
| 2 | 197001000001 | 1970 | 1 | 0 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | PHL | Assassination/Assassination Attempt | Philippines | PHL | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 7.559180e+09 | 8.159065e+08 | POINT (120.59974 15.47860) |
| 3 | 197001000002 | 1970 | 1 | 0 | NaN | 0 | NaT | 78 | Greece | 8 | ... | GRC | Bombing/Explosion | Greece | GRC | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 1.313986e+10 | 1.451886e+09 | POINT (23.76273 37.99749) |
| 4 | 197001000003 | 1970 | 1 | 0 | NaN | 0 | NaT | 101 | Japan | 4 | ... | JPN | Facility/Infrastructure Attack | Japan | JPN | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 2.126092e+11 | 2.754262e+10 | POINT (130.39636 33.58041) |
5 rows × 145 columns
We can now then plot our data on a world map, which geopandas provides us through its builtin naturalearth_lowres shapefile. We can make a base plot from this and then plot each of our data points on this map, which geopandas takes care of for us given the new geometry column added in the previous step. We refer to our atk1_txt column to allow geopandas to color-code the graph points based on the values of that column. We make the points semi-transparent (15\% opacity) so we can visually estimate the density of large clusters of points, we change the limits of the graph to contain exactly the set of valid latitudes and longitudes, and we add a legend to the right of the map:
ax = gdf.plot("atk1_txt",
ax= geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres')).plot(color='whitesmoke', edgecolor='black', figsize=(50,30)),
categorical=True,
legend=True,
cmap="Set1",
legend_kwds={'loc': 'center left',
'bbox_to_anchor':(1,0.5),
'markerscale':2,
'fontsize':20},
s=10,
alpha=0.15)
ax.set_xlim(-180, 180)
ax.set_ylim(-90, 90)
leg = ax.get_legend()
for lh in leg.legendHandles:
lh.set_alpha(1)
plt.show()
It is rather difficult to pick out details in this map due to the concentration of many of the attacks. For example, note that Northern Ireland is quite densely filled with bombings, explosions, and armed assaults (see The Troubles), and Nicaragua is densely filled with armed assaults (see Nicaraguan Revolution and the Contra War), though it is difficult to see how each country's attacks are distributed within each country, as we are looking at a world map as opposed to a more localized map. Hence, we can also make maps for individual regions and continents in order to more closely understand the scope of terrorist activity in finer regions.
We will start with North America, which we will plot in exactly the same way as our world map but, by changing the x-limits and y-limits of what is plotted, we "zoom in" on this region:
def plot_data_with_limits(xrange,yrange):
ax = gdf.plot("atk1_txt",
ax= geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres')).plot(color='whitesmoke', edgecolor='black', figsize=(50,30)),
categorical=True,
legend=True,
cmap="Set1",
legend_kwds={'loc': 'center left',
'bbox_to_anchor':(1,0.5),
'markerscale':2,
'fontsize':20},
s=10,
alpha=0.15)
ax.set_xlim(xrange)
ax.set_ylim(yrange)
leg = ax.get_legend()
for lh in leg.legendHandles:
lh.set_alpha(1)
plt.show()
plot_data_with_limits((-170, -50),(0, 90))
From the above map, we can see that most armed assault attacks in Nicaragua occurred in the northwestern portion of the country, whereas many bombings were scattered across the Pacific coast.
We now plot South America:
plot_data_with_limits((-90, -30),(-60, 15))
Note that here, we can see a pattern of bombings across western Colombia and armed assaults focused around Ayacucho, Peru.
We now plot Europe:
plot_data_with_limits((-20, 50),(30, 75))
Note that most armed assaults in Northern Ireland happened around Belfast, and explosions tended to happen in the southeast portion of the country. Note that many attacks were also localized in Kosovo (see Kosovo War) and eastern Ukraine (see Russo-Ukrainian War pre-2022).
We now plot Africa:
plot_data_with_limits((-25, 55),(-40, 40))
We see here a lot of armed assault in northeastern Nigeria as well as bombings across southern Somalia. Also observe that most attacks in Egypt are concentrated about the Nile River, where most Egyptian population centers are located.
We now plot the Middle East and Southwestern Asia:
plot_data_with_limits((25, 80),(10, 43))
Note that the West Bank area in the southern Levant is filled with armed assaults, whereas most of the Mediterranean coast of Israel and the Gaza Strip is densely filled with bombings and explosions. Observe also that much of Iraq, Pakistan, Afghanistan, and Yemen have quite dense concentrations of explosions and bombings.
We now plot central and East Asia:
plot_data_with_limits((50, 180),(0, 80))
Note that around Manila in the Philippines, there is a high density of assassinations and attempted assassinations, likely related to the many political assassinations speculated to have been committed by allies of Ferdinand Marcos Sr.. Note that such a high density of assassinations is unseen elsewhere on the map.
We finally plot Australia and Oceania:
plot_data_with_limits((80, 180),(-50, 10))
After using maps to visualize where attacks are located, we might wish to more rigorously see where most attacks are located, which we can do by filtering our dataframe for the countries which appear the most in our data:
gdf[gdf["country_txt"].isin(gdf["country_txt"].value_counts().head().index)].groupby("country_txt")["atk1_txt"].count().sort_index()
country_txt Afghanistan 18920 Colombia 8915 India 13929 Iraq 27521 Pakistan 15504 Name: atk1_txt, dtype: int64
From now on, we will focus on these five above countries, as well as Nicaragua, the United Kingdom (which contains Northern Ireland), and the Philippines, all of which were briefly discussed above, for more country-specific analyses. Hence, we will filter our dataframe for only these countries:
gdf_filtered = gdf[gdf["country_txt"].isin(["Afghanistan", "Colombia", "India", "Iraq", "Pakistan", "Nicaragua", "United Kingdom", "Philippines"])]
gdf_filtered.head()
| eventid | iyear | imonth | iday | approxdate | extended | resolution | country | country_txt | region | ... | country_code | atk1_txt | Country Name | Country Code | Indicator Name | Indicator Code | Year | GDP | gdp_change | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 197001000001 | 1970 | 1 | 0 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | PHL | Assassination/Assassination Attempt | Philippines | PHL | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 7.559180e+09 | 8.159065e+08 | POINT (120.59974 15.47860) |
| 26 | 197001210001 | 1970 | 1 | 21 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | PHL | Bombing/Explosion | Philippines | PHL | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 7.559180e+09 | 8.159065e+08 | POINT (121.05750 14.67428) |
| 39 | 197001310001 | 1970 | 1 | 31 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | PHL | Unknown | Philippines | PHL | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 7.559180e+09 | 8.159065e+08 | POINT (120.33162 15.67505) |
| 96 | 197003000001 | 1970 | 3 | 0 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | PHL | Bombing/Explosion | Philippines | PHL | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 7.559180e+09 | 8.159065e+08 | POINT (120.97867 14.59605) |
| 150 | 197003240001 | 1970 | 3 | 24 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | PHL | Unknown | Philippines | PHL | GDP (current US$) | NY.GDP.MKTP.CD | 1970.0 | 7.559180e+09 | 8.159065e+08 | POINT (120.59194 15.15300) |
5 rows × 145 columns
Now we might wish to see how each country's GDP changes over time:
fig, ax = plt.subplots()
for idx, gp in gdf_filtered[["country_txt", "Year", "GDP"]].drop_duplicates().groupby("country_txt"):
gp.plot(x='Year', y='GDP', ax=ax, label=idx, logy=True)
ax.set_ylabel("Yearly GDP, modern $US (log scale)")
ax.set_title("GDP in 2022 USD in Selected Countries Over Time")
ax.legend(loc="center left", bbox_to_anchor=(1,0.5))
plt.show()
Observe that Iran's GDP drops drastically in the early 90s (maybe due to the Gulf War), and Nicaragua's GDP drops slightly in the late 80s, possibly due to the aforementioned Nicaraguan Revolution. We can further explore these trends as they relate to terrorist activity in our Hypothesis Testing phase.
Now let's plot year-after-year change in GDP over time in our selected countries:
fig, ax = plt.subplots()
for idx, gp in gdf_filtered[["country_txt", "Year", "gdp_change"]].drop_duplicates().groupby("country_txt"):
gp.plot(x='Year', y='gdp_change', ax=ax, label=idx)
ax.set_ylabel("Change in Yearly GDP (next - current), modern $US")
ax.set_title("Change in GDP in 2022 USD Over Time")
ax.legend(loc="center left", bbox_to_anchor=(1,0.5))
plt.show()
Now we wish to get a cumulative total of terrorist attacks in our eight countries to see if they line up with our theories about the GDP data:
filtered_attack_counts = gdf_filtered[["attacktype1","country_txt", "Year", "Country Code"]].groupby(["country_txt", "Country Code", "Year"]).agg({'count'}).unstack().fillna(0)
filtered_attack_counts.columns = filtered_attack_counts.columns.droplevel(0).droplevel(0)
filtered_attack_counts = filtered_attack_counts.reset_index().melt(id_vars=["country_txt", "Country Code"], var_name="Year", value_name="count")
filtered_attack_counts["Year"] = filtered_attack_counts["Year"].apply(int)
filtered_attack_counts
| country_txt | Country Code | Year | count | |
|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1970 | 0.0 |
| 1 | Colombia | COL | 1970 | 1.0 |
| 2 | India | IND | 1970 | 0.0 |
| 3 | Iraq | IRQ | 1970 | 0.0 |
| 4 | Nicaragua | NIC | 1970 | 1.0 |
| ... | ... | ... | ... | ... |
| 395 | Iraq | IRQ | 2020 | 764.0 |
| 396 | Nicaragua | NIC | 2020 | 3.0 |
| 397 | Pakistan | PAK | 2020 | 294.0 |
| 398 | Philippines | PHL | 2020 | 294.0 |
| 399 | United Kingdom | GBR | 2020 | 90.0 |
400 rows × 4 columns
And we can plot count over time:
_, ax = plt.subplots()
for idx, gp in filtered_attack_counts.drop_duplicates().groupby("country_txt"):
gp.plot(x='Year', y='count', ax=ax, label=idx)
ax.set_ylabel("Amount of Terrorist Attacks in Given Year")
ax.set_title("Amount of Terrorist Attacks in Selected Countries Over Time")
ax.legend(loc="center left", bbox_to_anchor=(1,0.5))
plt.show()
Observe that the vast majority of terrorist attacks happening in Iraq, Afghanistan, Pakistan, India, and the Philippines have happened within about the past 15-20 years, which, specifically for Iraq, is outside of the range of its large GDP drop. We can confirm or reject any possible trends between terrorist activity and GDP in Hypothesis Testing, though.
Now let's see if there may be trends regarding which attack types happened more frequently in certain years:
from matplotlib.colors import rgb2hex
cmap = plt.get_cmap("Set1")
def colors_at_breaks(cmap, breaks):
return [rgb2hex(cmap(bb)) for bb in breaks]
breaks = [n/9+1/18 for n in range(9)]
colors = colors_at_breaks(cmap, breaks)
attack_type_counts = gdf_filtered[["atk1_txt","country_txt", "Year"]].groupby(["atk1_txt", "Year"]).agg({'count'}).unstack().fillna(0)
attack_type_counts.columns = attack_type_counts.columns.droplevel(0).droplevel(0)
attack_type_counts = attack_type_counts.reset_index().melt(id_vars="atk1_txt", var_name="Year", value_name="count")
attack_type_counts["Year"] = attack_type_counts["Year"].apply(int)
_, ax = plt.subplots()
for (idx, gp), color in zip(attack_type_counts.drop_duplicates().groupby("atk1_txt"), colors):
gp = pd.DataFrame(gp)
gp.plot('Year', 'count', ax=ax, label=idx, c=color)
ax.legend(loc="center left", bbox_to_anchor=(1,0.5))
ax.set_ylabel("Amount of Terrorist Attacks in Given Year")
ax.set_title("Amount of Types of Terrorist Attacks Over Time")
plt.show()
Now that we better understand the data, we can move to Hypothesis Testing to more formally confirm or reject our ideas.
In this phase of the data science pipeline, we use various models and statistical tests to both verify trends we might have observed in our data exploration phase as well as extend those trends to predict outside of our dataset.
We will first test if there are links between change in GDP and the amount of terrorist attacks we see within our eight selected countries. We can first explore these trends by scattering GDP with respect to terrorist attacks in each of our countries and using a regression model to see if a trend exists within each country. We can make each scatterplot as normal, then we can fit a linear regression model to each subset of the data using sklearn, which we can then plot on the same axes as we made our scatterplot.
Each model produces an r$^2$ value, which represents the percentage of variance in the target variable which can be accounted for by our independent variables. Generally, changing model parameters in a manner that increases the r$^2$ value increases the quality of a model, or how "good" it is.
Below, we merge our selected countries data to include both attack counts and GDP, and for each of our selected countries, we make a scatter plot and fit a Scikit-Learn Linear Regression model to all non-missing datapoints. We then plot this line on the same graph and save our r$^2$ values in a dictionary where it can be referenced later by the country whose model it belongs to.
attack_counts_with_gdp = filtered_attack_counts.merge(gdp_data, how="left", left_on=["Year","Country Code"], right_on=["Year","Country Code"])
r_2_values = {}
for index, gp in attack_counts_with_gdp.groupby("country_txt"):
ax = gp.plot.scatter("count", "gdp_change", figsize=(10,8))
ax.set_xlabel("Amount of Terrorist Attacks in Given Year")
ax.set_ylabel("Change in Yearly GDP (next - current), modern $US")
ax.set_title("Change in GDP by Amt. of Terr. Attacks in " + index)
non_na_counts = []
non_na_gdp = []
for a, b in zip(gp["count"].values.reshape(-1, 1),gp["gdp_change"].values.reshape(-1, 1)):
if not np.isnan(a) and not np.isnan(b):
non_na_counts.append(a)
non_na_gdp.append(b)
clf = linear_model.LinearRegression()
clf.fit(non_na_counts, non_na_gdp)
predicted = clf.predict(non_na_counts)
ax.plot(non_na_counts,predicted, c="red")
r_2 = clf.score(non_na_counts,non_na_gdp)
r_2_values[index] = r_2
Now we can see our r$^2$ values:
for key, value in r_2_values.items():
print(f"The r^2 value for the {key} model relating change in GDP with amount of terrorist attacks is {(value):.7f}, \n\t\
so {(value*100):.5f}% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.\n")
The r^2 value for the Afghanistan model relating change in GDP with amount of terrorist attacks is 0.0339466, so 3.39466% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year. The r^2 value for the Colombia model relating change in GDP with amount of terrorist attacks is 0.0422305, so 4.22305% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year. The r^2 value for the India model relating change in GDP with amount of terrorist attacks is 0.1908505, so 19.08505% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year. The r^2 value for the Iraq model relating change in GDP with amount of terrorist attacks is 0.0000125, so 0.00125% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year. The r^2 value for the Nicaragua model relating change in GDP with amount of terrorist attacks is 0.0454709, so 4.54709% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year. The r^2 value for the Pakistan model relating change in GDP with amount of terrorist attacks is 0.1337640, so 13.37640% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year. The r^2 value for the Philippines model relating change in GDP with amount of terrorist attacks is 0.0748030, so 7.48030% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year. The r^2 value for the United Kingdom model relating change in GDP with amount of terrorist attacks is 0.0138532, so 1.38532% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.
Note that for some of these countries, there exists a positive correlation between terrorist activity and yearly change in GDP, which we would not expect. Hence, we might suspect that there is some lurking variable that affects both change in GDP and amount of terrorist activity. One such variable might be Year, so we'll repeat our above steps while also including Year in our linear models.
r_2_values = {}
for index, gp in attack_counts_with_gdp.groupby("country_txt"):
non_na_counts = []
non_na_gdp = []
non_na_years = []
for a, b, c in zip(gp["count"].values.reshape(-1, 1),gp["gdp_change"].values.reshape(-1, 1),gp["Year"].values.reshape(-1, 1)):
if not np.isnan(a) and not np.isnan(b):
non_na_counts.append(a)
non_na_gdp.append(b)
non_na_years.append(c)
X = pd.DataFrame({"counts":non_na_counts, "years":non_na_years})
clf = linear_model.LinearRegression()
clf.fit(X, non_na_gdp)
[[m_count,m_year]] = clf.coef_
[b] = clf.intercept_
print(f"{index} Change in GDP = {m_count:.0f} * attack_count + {m_year:.0f} * year - {-b:.0f}")
r_2 = clf.score(X, non_na_gdp)
r_2_values[index] = r_2
Afghanistan Change in GDP = -1367563 * attack_count + 56786062 * year - 112027805758 Colombia Change in GDP = -32586933 * attack_count + 169264141 * year - 326041133226 India Change in GDP = 28019464 * attack_count + 3276383273 * year - 6483061436543 Iraq Change in GDP = -2621742 * attack_count + 259035383 * year - 511309926848 Nicaragua Change in GDP = -914302 * attack_count + 8054741 * year - 15810674834 Pakistan Change in GDP = 5335389 * attack_count + 221041197 * year - 435923784511 Philippines Change in GDP = -8748330 * attack_count + 469776536 * year - 928267738752 United Kingdom Change in GDP = -235260828 * attack_count + 109292141 * year - 132560598083
for key, value in r_2_values.items():
print(f"The r^2 value for the {key} model relating change in GDP with amount of terrorist attacks and Year is {(value):.7f}, \n\t\
so {(value*100):.5f}% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.\n")
The r^2 value for the Afghanistan model relating change in GDP with amount of terrorist attacks and Year is 0.4245499, so 42.45499% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year. The r^2 value for the Colombia model relating change in GDP with amount of terrorist attacks and Year is 0.0548931, so 5.48931% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year. The r^2 value for the India model relating change in GDP with amount of terrorist attacks and Year is 0.2562162, so 25.62162% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year. The r^2 value for the Iraq model relating change in GDP with amount of terrorist attacks and Year is 0.0061370, so 0.61370% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year. The r^2 value for the Nicaragua model relating change in GDP with amount of terrorist attacks and Year is 0.0898135, so 8.98135% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year. The r^2 value for the Pakistan model relating change in GDP with amount of terrorist attacks and Year is 0.1713876, so 17.13876% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year. The r^2 value for the Philippines model relating change in GDP with amount of terrorist attacks and Year is 0.2696958, so 26.96958% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year. The r^2 value for the United Kingdom model relating change in GDP with amount of terrorist attacks and Year is 0.0139395, so 1.39395% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.
Our models are not perfect, but we can see now that generally, keeping other factors constant, if the amount of terrorist attacks in a given year increases, GDP will increase by a lesser amount in that year. Note however that Year is a much better predictor; looking at Afghanistan, for example, we could see that 3.39466% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year, whereas 42.45499% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year, implying that in Afghanistan, Year is a more reliable metric for the prediction of GDP than amount of terrorist attacks. We still can work on improving these models; if we can eliminate as many lurking variables for amount of terrorist attacks as we can, we will have a more accurate model, though this would require more data.
Now we will pivot to assessing if certain types of attacks happen more or less frequently over time. We can repeat similar steps as above, though using Year as the independent variable and count of each category of attack as our dependent variable. Recall from above our plot of terrorist attack categories over time. We wish to see if there exists a linear relationship between Year and yearly count of any of these attack types, which would allow us to see which types of attacks are becoming more or less common.
r_2_values = {}
for index, gp in attack_type_counts.groupby("atk1_txt"):
clf = linear_model.LinearRegression()
clf.fit(gp["Year"].values.reshape(-1, 1), gp["count"].values.reshape(-1, 1))
[[m_year]] = clf.coef_
[b] = clf.intercept_
print(f"Amount of {index} Attacks = {m_year:.5f} * year - {-b:.0f}")
r_2 = clf.score(gp["Year"].values.reshape(-1, 1), gp["count"].values.reshape(-1, 1))
r_2_values[index] = r_2
Amount of Armed Assault Attacks = 24.76444 * year - 48960 Amount of Assassination/Assassination Attempt Attacks = 5.79495 * year - 11353 Amount of Bombing/Explosion Attacks = 69.33689 * year - 137322 Amount of Facility/Infrastructure Attack Attacks = 4.62134 * year - 9144 Amount of Hijacking Attacks = 0.22425 * year - 443 Amount of Hostage Taking (Barricading) Attacks = 0.29176 * year - 574 Amount of Hostage Taking (Kidnapping) Attacks = 8.44190 * year - 16706 Amount of Unarmed Assault (inc. Chem., Bio., Rad. attacks) Attacks = 0.51827 * year - 1025 Amount of Unknown Attacks = 9.31523 * year - 18468
for key, value in r_2_values.items():
print(f"The r^2 value for the {key} model relating amount of that type of terrorist attack with Year is {(value):.7f}, \n\t\
so {(value*100):.5f}% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.\n")
The r^2 value for the Armed Assault model relating amount of that type of terrorist attack with Year is 0.5684934, so 56.84934% of the variance in amount of that type of terrorist attack between data points can be attributed to Year. The r^2 value for the Assassination/Assassination Attempt model relating amount of that type of terrorist attack with Year is 0.2994787, so 29.94787% of the variance in amount of that type of terrorist attack between data points can be attributed to Year. The r^2 value for the Bombing/Explosion model relating amount of that type of terrorist attack with Year is 0.5176269, so 51.76269% of the variance in amount of that type of terrorist attack between data points can be attributed to Year. The r^2 value for the Facility/Infrastructure Attack model relating amount of that type of terrorist attack with Year is 0.5104329, so 51.04329% of the variance in amount of that type of terrorist attack between data points can be attributed to Year. The r^2 value for the Hijacking model relating amount of that type of terrorist attack with Year is 0.3731653, so 37.31653% of the variance in amount of that type of terrorist attack between data points can be attributed to Year. The r^2 value for the Hostage Taking (Barricading) model relating amount of that type of terrorist attack with Year is 0.2058557, so 20.58557% of the variance in amount of that type of terrorist attack between data points can be attributed to Year. The r^2 value for the Hostage Taking (Kidnapping) model relating amount of that type of terrorist attack with Year is 0.5940840, so 59.40840% of the variance in amount of that type of terrorist attack between data points can be attributed to Year. The r^2 value for the Unarmed Assault (inc. Chem., Bio., Rad. attacks) model relating amount of that type of terrorist attack with Year is 0.2960495, so 29.60495% of the variance in amount of that type of terrorist attack between data points can be attributed to Year. The r^2 value for the Unknown model relating amount of that type of terrorist attack with Year is 0.3877592, so 38.77592% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.
Note that generally, among each category of terrorist attack, there is unfortunately a positive correlation between Year and amount of that category of terrorist attack. For example, we can expect about 25 more Armed Assault terrorist attacks in each subsequent year, almost 6 more Assassinations/Attempts in each subsequent year, and nearly 70 more Bombing/Explosion attacks in each subsequent year compared to the current year. Note also that the r$^2$ values are generally high; more than half (51.76%) of all year-to-year variance across the amount of Bombings and Explosions can be attributed to simply the year itself. Again referring to our above plot, we can observe sharp increases in terrorist attacks and activity post-9/11, implying that especially recently, terrorist activity has unfortunately started to drastically increase with time, a quite sobering thought.
Through our data exploration and analysis, we can observe that:
Using this information, we conclude that terrorist activity has a relatively small but not insignificant impact on GDP, so if we are to continue to productively improve as a global community, we must reduce the amount of terrorist attacks happening in the world by seeking peace when possible. This analysis provides an interesting insight to the impacts of terrorism on the world, and although country-specific GDP is not currently taking a particularly large hit as a result of increased terrorist activity, we must work towards a more peaceful world before the impact of war and terror is too detrimental to world economies and societies. Loss of life, liberty, and public infrastructure are painful impacts of terrorist activity, and we as a world will be truly better if we can overcome this plague.